{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LtqlhvqcAvlN"
   },
   "source": [
    "# Data Preprocessing\n",
    "\n",
    "Preprocessing is the process of transforming the raw data into something that the machine learning algorithm can understand. Within this context, most algorithms will demand that you specify what “type” of data a feature is. We quickly review the different types of features and how to encode them in pandas/sklearn below. Note that **transformations should always be applied separately on the training and test data** since calculating statistics on the test data is considered a form of “cheating”(data leakage). While we give alternatives for both pandas and sklearn, it is usually more convenient to work with sklearn for this reason."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12343\n"
     ]
    }
   ],
   "source": [
    "print(12343)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HrMW5-BGAvlT"
   },
   "source": [
    "## Handling missing data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BvdcVHmDAvlU"
   },
   "source": [
    "Missing data is one of the most annoying, but also most common problems in any machine learning project. Your algorithms will assume your data matrices to be completely filled with numerical values so we need to somehow ensure this requirement. There are two basic strategies here: (a) dropping features or observations with many missing values and (b) imputing the values. There are various imputation strategies, but they all replace the missing value with some statistic that is calculated on the other rows for this particular feature. We may for instance replace a missing value with the mean of that feature or the most frequent value (the mode)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 391
    },
    "id": "UBLjyNRwAvlU",
    "outputId": "7f92b1b8-8b48-467c-abc2-ee49931be3a8"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1   2    3\n",
       "0  5.1  3.5 NaN  0.2\n",
       "1  4.9  3.0 NaN  0.2\n",
       "2  4.7  3.2 NaN  0.2\n",
       "3  4.6  3.1 NaN  0.2\n",
       "4  5.0  3.6 NaN  0.2"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>7.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6.3</td>\n",
       "      <td>3.4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.4</td>\n",
       "      <td>3.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6.9</td>\n",
       "      <td>3.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1   2    3\n",
       "0  7.7  3.0 NaN  2.3\n",
       "1  6.3  3.4 NaN  2.4\n",
       "2  6.4  3.1 NaN  1.8\n",
       "3  6.0  3.0 NaN  1.8\n",
       "4  6.9  3.1 NaN  2.1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "#Preparing Dataset\n",
    "import pandas as pd\n",
    "import sklearn.datasets\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "iris = sklearn.datasets.load_iris()\n",
    "X = iris.data   \n",
    "y = iris.target\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=False)\n",
    "\n",
    "#Create some missing values\n",
    "X_train[:5, 2] = np.nan\n",
    "X_test[:5, 2] = np.nan\n",
    "display(pd.DataFrame(X_train).head())\n",
    "display(pd.DataFrame(X_test).head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 391
    },
    "id": "UcPG8540AvlW",
    "outputId": "bef22fbb-932c-432c-9af7-5e29f3638998"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hello\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2    3\n",
       "0  5.1  3.5  1.5  0.2\n",
       "1  4.9  3.0  1.5  0.2\n",
       "2  4.7  3.2  1.5  0.2\n",
       "3  4.6  3.1  1.5  0.2\n",
       "4  5.0  3.6  1.5  0.2"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>7.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>2.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6.3</td>\n",
       "      <td>3.4</td>\n",
       "      <td>1.5</td>\n",
       "      <td>2.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.4</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>1.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>1.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6.9</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>2.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2    3\n",
       "0  7.7  3.0  1.5  2.3\n",
       "1  6.3  3.4  1.5  2.4\n",
       "2  6.4  3.1  1.5  1.8\n",
       "3  6.0  3.0  1.5  1.8\n",
       "4  6.9  3.1  1.5  2.1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.impute import SimpleImputer\n",
    "\n",
    "imputer = SimpleImputer(strategy=\"most_frequent\")\n",
    "\n",
    "# Strategy avail: mean median most_frequent\n",
    "print(\"hello\")\n",
    "\n",
    "imputer.fit(X_train)\n",
    "X_train = imputer.transform(X_train)\n",
    "X_test = imputer.transform(X_test)\n",
    "\n",
    "display(pd.DataFrame(X_train).head())\n",
    "display(pd.DataFrame(X_test).head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "g93Nz_5yAvlW"
   },
   "source": [
    "### KNN Imputation\n",
    "KNN imputations imputes missing values using the weighted or unweighted mean of the desired number of nearest neighbors. In practice, KNN "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 765
    },
    "id": "kszNK4jUAvlW",
    "outputId": "45c63923-8387-4384-d713-8f56c45ba539",
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2    3\n",
       "0  5.1  3.5  1.5  0.2\n",
       "1  4.9  3.0  1.5  0.2\n",
       "2  4.7  3.2  1.5  0.2\n",
       "3  4.6  3.1  1.5  0.2\n",
       "4  5.0  3.6  1.5  0.2"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>7.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>2.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6.3</td>\n",
       "      <td>3.4</td>\n",
       "      <td>1.5</td>\n",
       "      <td>2.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.4</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>1.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>1.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6.9</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>2.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2    3\n",
       "0  7.7  3.0  1.5  2.3\n",
       "1  6.3  3.4  1.5  2.4\n",
       "2  6.4  3.1  1.5  1.8\n",
       "3  6.0  3.0  1.5  1.8\n",
       "4  6.9  3.1  1.5  2.1"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2    3\n",
       "0  5.1  3.5  1.5  0.2\n",
       "1  4.9  3.0  1.5  0.2\n",
       "2  4.7  3.2  1.5  0.2\n",
       "3  4.6  3.1  1.5  0.2\n",
       "4  5.0  3.6  1.5  0.2"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>7.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>2.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6.3</td>\n",
       "      <td>3.4</td>\n",
       "      <td>1.5</td>\n",
       "      <td>2.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.4</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>1.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>1.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6.9</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>2.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2    3\n",
       "0  7.7  3.0  1.5  2.3\n",
       "1  6.3  3.4  1.5  2.4\n",
       "2  6.4  3.1  1.5  1.8\n",
       "3  6.0  3.0  1.5  1.8\n",
       "4  6.9  3.1  1.5  2.1"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.impute import KNNImputer\n",
    "\n",
    "display(pd.DataFrame(X_train).head())\n",
    "display(pd.DataFrame(X_test).head())\n",
    "\n",
    "knnImputer = KNNImputer(n_neighbors=5, weights=\"uniform\")\n",
    "knnImputer.fit(X_train)\n",
    "X_train = knnImputer.transform(X_train)\n",
    "X_test = knnImputer.transform(X_test)\n",
    "\n",
    "display(pd.DataFrame(X_train).head())\n",
    "display(pd.DataFrame(X_test).head())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Nv7rKg53AvlX"
   },
   "source": [
    "Another possible way to impute missing value is to train a regression model which estimates the missing value based on other variables. You could do that with sklearn.impute.IterativeImputer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UsIcYQ5nAvlX"
   },
   "source": [
    "Note: for categorical values (see later), it is often not a good to impute values. Instead, you will want to treat “missing”as a special category/level."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "C2iUm-QqAvlY"
   },
   "source": [
    "## Feature creation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "c4DnNGnzAvlY"
   },
   "source": [
    "At this point, it may be useful to engineer your own features. You may want to take the log of some variable or perhaps add polynomial features as we saw in class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "joYIjh10AvlY"
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "poly = PolynomialFeatures(3).fit(X_train)\n",
    "X_train_poly = poly.transform(X_train)\n",
    "X_test_poly = poly.transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "t5Fq-UHRAvlY"
   },
   "source": [
    "Beyond these generic things, you may also want to start thinking about adding external data that is relevant to your learning problem, for instance:\n",
    "* predicting insurance claims: add weather forecasts\n",
    "* ice cream sales: add the temperature of the day as a variable\n",
    "* beer company: add world cup activity data\n",
    "* stock prediction: add whether or not a big news event happened on that day \n",
    "* ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TQfMJo7IAvlZ"
   },
   "source": [
    "## Vectorizing categorical data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XGDNC0_eAvlZ"
   },
   "source": [
    "Categorical variables usually describe traits and are often encoded as either text strings or as integers. Examples include size (“Small”,“Medium”,“Large”), country (“China”, “USA”, “Belgium”, ...), etc. We call the number of values a categorical variable can have the **cardinality** of the variable. A special type of categorical variables are **ordinal** variables which are ordered in addition to being categorical (e.g., size is an ordinal variable, country is not). Sometimes categorical variables are created for historical or privacy reasons (e.g., bucketing of salaries into brackets) though this is often not beneficial for learning performance. In pandas we can register a variable as being categorical by casting it as such:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 425
    },
    "id": "aNZ0gkvcAvlZ",
    "outputId": "a2052de9-9715-4069-bb27-538353de70ee"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species\n",
       "0           5.1          3.5           1.4          0.2  setosa\n",
       "1           4.9          3.0           1.4          0.2  setosa\n",
       "2           4.7          3.2           1.3          0.2  setosa\n",
       "3           4.6          3.1           1.5          0.2  setosa\n",
       "4           5.0          3.6           1.4          0.2  setosa"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0         setosa\n",
       "1         setosa\n",
       "2         setosa\n",
       "3         setosa\n",
       "4         setosa\n",
       "         ...    \n",
       "145    virginica\n",
       "146    virginica\n",
       "147    virginica\n",
       "148    virginica\n",
       "149    virginica\n",
       "Name: species, Length: 150, dtype: category\n",
       "Categories (3, object): ['setosa', 'versicolor', 'virginica']"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "lbls = LabelEncoder()\n",
    "iris_df = pd.read_csv(\"https://raw.githubusercontent.com/yx1215/Machine_Learning_Dataset/main/iris.csv\")\n",
    "\n",
    "display(iris_df.head())\n",
    "\n",
    "iris_df[\"species\"] = iris_df[\"species\"].astype(\"category\")\n",
    "display(iris_df[\"species\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "su7xZLTGAvla"
   },
   "source": [
    "### Label Encoder\n",
    "\n",
    "One way to convert a categorical data into model-understandable numerical data is to assign each category an integer between $0$ and $C$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 816
    },
    "id": "L_dvqRzHAvla",
    "outputId": "0d904549-4afa-4468-8edb-e9d10bcd559a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Before encoding: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species\n",
       "0           5.1          3.5           1.4          0.2  setosa\n",
       "1           4.9          3.0           1.4          0.2  setosa\n",
       "2           4.7          3.2           1.3          0.2  setosa\n",
       "3           4.6          3.1           1.5          0.2  setosa\n",
       "4           5.0          3.6           1.4          0.2  setosa"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>6.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.3</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>6.3</td>\n",
       "      <td>2.5</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1.9</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.0</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>6.2</td>\n",
       "      <td>3.4</td>\n",
       "      <td>5.4</td>\n",
       "      <td>2.3</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>5.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.1</td>\n",
       "      <td>1.8</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sepal_length  sepal_width  petal_length  petal_width    species\n",
       "145           6.7          3.0           5.2          2.3  virginica\n",
       "146           6.3          2.5           5.0          1.9  virginica\n",
       "147           6.5          3.0           5.2          2.0  virginica\n",
       "148           6.2          3.4           5.4          2.3  virginica\n",
       "149           5.9          3.0           5.1          1.8  virginica"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After encoding: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width  species\n",
       "0           5.1          3.5           1.4          0.2        0\n",
       "1           4.9          3.0           1.4          0.2        0\n",
       "2           4.7          3.2           1.3          0.2        0\n",
       "3           4.6          3.1           1.5          0.2        0\n",
       "4           5.0          3.6           1.4          0.2        0"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>6.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>6.3</td>\n",
       "      <td>2.5</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1.9</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>6.2</td>\n",
       "      <td>3.4</td>\n",
       "      <td>5.4</td>\n",
       "      <td>2.3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>5.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.1</td>\n",
       "      <td>1.8</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sepal_length  sepal_width  petal_length  petal_width  species\n",
       "145           6.7          3.0           5.2          2.3        2\n",
       "146           6.3          2.5           5.0          1.9        2\n",
       "147           6.5          3.0           5.2          2.0        2\n",
       "148           6.2          3.4           5.4          2.3        2\n",
       "149           5.9          3.0           5.1          1.8        2"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "dtype('int64')"
      ]
     },
     "execution_count": 7,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "lbls = LabelEncoder()\n",
    "iris_df = pd.read_csv(\"https://raw.githubusercontent.com/yx1215/Machine_Learning_Dataset/main/iris.csv\")\n",
    "\n",
    "print(\"Before encoding: \")\n",
    "display(iris_df.head())\n",
    "display(iris_df.tail())\n",
    "iris_df[\"species\"] = lbls.fit_transform(iris_df[\"species\"])\n",
    "\n",
    "print(\"After encoding: \")\n",
    "display(iris_df.head())\n",
    "display(iris_df.tail())\n",
    "iris_df[\"species\"].dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "m3eSz-PiAvlb"
   },
   "source": [
    "It is worth noting that label encoding introduces a new problem. The different numbers assigned to each category will make the model misunderstand the data to be in some kind of order, since $0 < 1 < 2 < ... < C$. But this is not the case for non-ordinal data. Therefore, label encoding is only recommended if ordering strongly affects the relationship, or when there are too many categories and not enough data (see later)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZB-QCdzXAvlb"
   },
   "source": [
    "### One-hot Encoder\n",
    "Some algorithms (deep learning in particular) require a different encoding called **one-hot-encoding** (a.k.a. **dummy encoding**, **indicator variables**). In one-hot-encoding we convert each category value into a new column of a vector and assign to that column a 1/0 value. E.g.:\n",
    "\n",
    "<img src=\"./One_hot.png\" width=\"50%\">\n",
    "\n",
    "In this example, each row is converted into a vector of length 3 in which the first position indicates“Small”, the second“Medium”and the last“Large”. It is generally considered good practice to remove one variable to avoid issues of co-linearity even though this does not really matter as much in machine learning. Both pandas and sklearn have functions to do this for you:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 510
    },
    "id": "OY97t_11Avlc",
    "outputId": "bd5f2443-9a20-4fb4-e652-0c7a32fa337c"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width species\n",
       "0           5.1          3.5           1.4          0.2  setosa\n",
       "1           4.9          3.0           1.4          0.2  setosa\n",
       "2           4.7          3.2           1.3          0.2  setosa\n",
       "3           4.6          3.1           1.5          0.2  setosa\n",
       "4           5.0          3.6           1.4          0.2  setosa"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>6.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.3</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>6.3</td>\n",
       "      <td>2.5</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1.9</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.0</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>6.2</td>\n",
       "      <td>3.4</td>\n",
       "      <td>5.4</td>\n",
       "      <td>2.3</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>5.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.1</td>\n",
       "      <td>1.8</td>\n",
       "      <td>virginica</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sepal_length  sepal_width  petal_length  petal_width    species\n",
       "145           6.7          3.0           5.2          2.3  virginica\n",
       "146           6.3          2.5           5.0          1.9  virginica\n",
       "147           6.5          3.0           5.2          2.0  virginica\n",
       "148           6.2          3.4           5.4          2.3  virginica\n",
       "149           5.9          3.0           5.1          1.8  virginica"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "150\n",
      "(150, 3)\n",
      "[[1. 0. 0.]\n",
      " [1. 0. 0.]\n",
      " [1. 0. 0.]\n",
      " [1. 0. 0.]\n",
      " [1. 0. 0.]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "lbls = OneHotEncoder()\n",
    "iris_df = pd.read_csv(\"https://raw.githubusercontent.com/yx1215/Machine_Learning_Dataset/main/iris.csv\")\n",
    "\n",
    "display(iris_df.head())\n",
    "display(iris_df.tail())\n",
    "\n",
    "print(iris_df[\"species\"].size)\n",
    "out = lbls.fit_transform(iris_df[[\"species\"]]).toarray()\n",
    "print(out.shape)\n",
    "print(out[:5, :])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Qt-w2-REAvlc"
   },
   "source": [
    "Alternatively, you could use pd.get_dummies()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 419
    },
    "id": "wbwm8-X0Avlc",
    "outputId": "81de8205-d422-498c-e46f-3a864618d2fd"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>setosa</th>\n",
       "      <th>versicolor</th>\n",
       "      <th>virginica</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>150 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     setosa  versicolor  virginica\n",
       "0         1           0          0\n",
       "1         1           0          0\n",
       "2         1           0          0\n",
       "3         1           0          0\n",
       "4         1           0          0\n",
       "..      ...         ...        ...\n",
       "145       0           0          1\n",
       "146       0           0          1\n",
       "147       0           0          1\n",
       "148       0           0          1\n",
       "149       0           0          1\n",
       "\n",
       "[150 rows x 3 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.get_dummies(iris_df[\"species\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JnecA9gNAvld"
   },
   "source": [
    "However, it is often not suitable to apply one-hot encoding when there are too many categories. Every category level would create a new variable and therefore increase potential variance problems, label encoding reduces this problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OrX92tdnAvld"
   },
   "source": [
    "## Normalizing continuous data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MTql2fGaAvld"
   },
   "source": [
    "Continuous data take on a continuous range of values (e.g., temperature, wind-speed or sales per day). Most machine learning algorithms (all which use some kind of gradient descent) require that the data is somehow scaled. These scaling methods ensure that the data is spread around zero. E.g., the min-max scaling procedure applies the following formula to the continuous columns of a dataset:\n",
    "\n",
    "\\begin{equation*}\n",
    "x_j \\leftarrow{} \\frac{x_j - \\min X_j}{\\max X_j - \\min X_j}\n",
    "\\end{equation*}\n",
    "\n",
    "If the data is normally distributed, you may want to use standardization instead:\n",
    "\n",
    "\\begin{equation*}\n",
    "x_j \\leftarrow \\frac{x_j - \\mu x_j}{\\sigma_{X_j}}\n",
    "\\end{equation*}\n",
    "\n",
    "Sometimes you will see that people try to ensure that the l1 or l2 norm of the features are 1. In practice the difference between all of these methods is minute, but it is crucially important that you do use some kind of normalization. In python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "gHEK-QCuAvld",
    "outputId": "f23e8090-0997-4db3-c215-6e3b1f7fef07"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ellipsis"
      ]
     },
     "execution_count": 10,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler, Normalizer, MaxAbsScaler\n",
    "\n",
    "scaler = MinMaxScaler().fit(X_train)\n",
    "X_train = scaler.transform(X_train)\n",
    "X_test = scaler.transform(X_test)\n",
    "\n",
    "scaler = Normalizer().fit(X_train)\n",
    "X_train = scaler.transform(X_train)\n",
    "X_test = scaler.transform(X_test)\n",
    "\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PpsDqfMSAvle"
   },
   "source": [
    "## Time series data\n",
    "Time-series are best converted to some standard format such as pd.datetime. It is often beneficial to generate extra cyclical variables based on the data such as:\n",
    "* day of the week\n",
    "* week number\n",
    "* month\n",
    "* quarter\n",
    "* year\n",
    "* ...\n",
    "\n",
    "These can be treated as categorical and signal to the machine learning algorithm that it may have to handle instances coming from different periods completely different. It is as if you are giving the machine learning model all sorts of interesting extra time-series data to play with. These can be added automatically in using standard pandas code):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "iH6sc50QAvle",
    "outputId": "ed07b225-f48b-4e71-d0cd-0e1241f64b97"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    3/11/2000\n",
       "1    3/12/2000\n",
       "2    3/13/2000\n",
       "3    12/2/2020\n",
       "4     7/4/2021\n",
       "dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Create some fake date data\n",
    "s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000', \"12/2/2020\", \"7/4/2021\"])\n",
    "s.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "-X1xp6zkAvle",
    "outputId": "64467362-06c5-4380-8af7-d71772ec8879"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0   2000-03-11\n",
       "1   2000-03-12\n",
       "2   2000-03-13\n",
       "3   2020-12-02\n",
       "4   2021-07-04\n",
       "dtype: datetime64[ns]"
      ]
     },
     "execution_count": 12,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = pd.to_datetime(s, infer_datetime_format=True)\n",
    "# s = pd.to_datetime(s, format=\"%m/%d/%Y\") ##alternatively you could specify the format\n",
    "s.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 425
    },
    "id": "b2nNPmc5Avlf",
    "outputId": "05c29362-d9fb-4ded-e18a-6ae183fd66bf",
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    2000\n",
       "1    2000\n",
       "2    2000\n",
       "3    2020\n",
       "4    2021\n",
       "dtype: int64"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    1\n",
       "2    1\n",
       "3    4\n",
       "4    3\n",
       "dtype: int64"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0    11\n",
       "1    12\n",
       "2    13\n",
       "3     2\n",
       "4     4\n",
       "dtype: int64"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0    5\n",
       "1    6\n",
       "2    0\n",
       "3    2\n",
       "4    6\n",
       "dtype: int64"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Year\n",
    "display(s.dt.year)\n",
    "\n",
    "# Quarter\n",
    "display(s.dt.quarter)\n",
    "\n",
    "# Day\n",
    "display(s.dt.day)\n",
    "\n",
    "# Weekday, 0 stands for Monday, 6 stands for Sunday\n",
    "display(s.dt.weekday)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "xSJqim3QAvlf",
    "outputId": "68489936-3e02-4928-b5af-d0930a594748"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1677-09-21 00:12:43.145225\n",
      "2262-04-11 23:47:16.854775807\n"
     ]
    }
   ],
   "source": [
    "print(pd.Timestamp.min)\n",
    "print(pd.Timestamp.max)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZlIUpnt1Avlf"
   },
   "source": [
    "## Writing custom preprocessing steps\n",
    "It’s also possible (and often needed) to write your own preprocessing steps. In sklearn any function that transforms the data is called a transformer. By design, sklearn does not rely on inheritance, but on duck typing which means that all you need to do, is implement a class with the three required methods: fit() and transform() and fit_transform() (you can get the last one for free by extending TransformerMixin). These take as input a numpy array or matrix. If you want to parametrize your preprocessing step, you can do so by extending the BaseEstimator class. This will give you two extra methods for free: get_params() and set_params(). For instance, you could recreate a standardization preprocessing step by implementing it as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "LMsh7rFUAvlg"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "\n",
    "class MyStandardScaler(BaseEstimator, TransformerMixin):\n",
    "    def fit(self, X, y=None):\n",
    "        self.mean_ = np.mean(X,axis=0)\n",
    "        self.std_ = np.std(X, axis=0)\n",
    "        return self\n",
    "    def transform(self, X, y=\"None\", copy=None):\n",
    "        X = (X-self.mean_)/self.std_\n",
    "        return X\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d79YQypBAvlg"
   },
   "source": [
    "As a more advanced example, consider the following preprocesser which takes a datetime object as input and converts it to the day of the week:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "J7ONjeOLAvli"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "\n",
    "class DayOfWeek(BaseEstimator, TransformerMixin):\n",
    "    def __init__(self, date_iloc = 0):\n",
    "        self.date_iloc = date_iloc\n",
    "        \n",
    "    def fit(self, X, y=None):\n",
    "        return self\n",
    "    def transform(self, X, y=None):\n",
    "        datetimes = pd.to_datetime(X[:, self.date_iloc], infer_datetime_format=True)\n",
    "        dayofweek = np.array(datetimes.dayofweek)\n",
    "        return np.c_[X,dayofweek]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TrVn8P7bAvlj"
   },
   "source": [
    "Note that while we use pandas in the above code, this is actually not needed. Generally, you will want to input and output numpy matrices/arrays. In this case we use pandas because it helps us to easily extract the day of week variable that we are looking for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "W7vDPznDAvlj",
    "outputId": "2e51737f-9c6f-4358-d5ba-a827a734a1dd"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['3/11/2000']\n",
      " ['3/12/2000']\n",
      " ['3/13/2000']\n",
      " ['12/2/2020']\n",
      " ['7/4/2021']]\n",
      "[['3/11/2000' 5]\n",
      " ['3/12/2000' 6]\n",
      " ['3/13/2000' 0]\n",
      " ['12/2/2020' 2]\n",
      " ['7/4/2021' 6]]\n"
     ]
    }
   ],
   "source": [
    "s = pd.DataFrame([['3/11/2000'], ['3/12/2000'], ['3/13/2000'], [\"12/2/2020\"], [\"7/4/2021\"]]).values\n",
    "print(s)\n",
    "SS = DayOfWeek()\n",
    "\n",
    "s = SS.fit_transform(s)\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "SBvrNKKMAvlk",
    "outputId": "0850a789-eb2e-4949-ad22-ad6f9df8b355"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'date_iloc': 0}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'date_iloc': 1}"
      ]
     },
     "execution_count": 18,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(SS.get_params())\n",
    "params = {\"date_iloc\": 1}\n",
    "SS.set_params(**params)\n",
    "SS.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "30dGPujXAvlk"
   },
   "source": [
    "## Exercise\n",
    "Write a transformer such that each column has l2 norm $k$ after transformation. Here $k$ should be a parameter of your transformer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "YSRqGcDvAvlk"
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "YiSDCZEGAvll"
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "orArKj7VAvll"
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "bJkTADGpAvll"
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "Qwxn9s2RAvlm"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Data Preprocessing.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
